Skip to content

3635: Fixes a memory leak issue in the Avro C library under abnormal data scenarios.#31

Open
martin-augment wants to merge 1 commit intomainfrom
pr-3635-2026-01-27-12-58-52
Open

3635: Fixes a memory leak issue in the Avro C library under abnormal data scenarios.#31
martin-augment wants to merge 1 commit intomainfrom
pr-3635-2026-01-27-12-58-52

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

3635: To review by AI

@github-actions github-actions bot added the C label Jan 27, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical memory leak vulnerability within the Avro C library, specifically targeting scenarios involving malformed or abnormal input data. By introducing a mechanism to determine the maximum readable buffer size and a new safe read macro that ensures proper memory deallocation upon read failures, the changes significantly enhance the library's robustness and prevent resource exhaustion when processing potentially malicious or corrupted data streams.

Highlights

  • Introduction of avro_max_read function: This new function, declared in avro/io.h and implemented in io.c, calculates the maximum number of bytes that can be safely read from an Avro reader, supporting both memory and file-based I/O.
  • Creation of AVRO_SAFE_READ macro: A new macro in encoding.h that wraps the avro_read function. It adds a crucial safety mechanism to free allocated memory (buf) and set it to NULL if the read operation fails, thereby preventing memory leaks.
  • Enhanced byte and string reading: The read_bytes and read_string functions in encoding_binary.c now utilize the AVRO_SAFE_READ macro. Additionally, read_string incorporates a new check using avro_max_read to validate that the declared string length does not exceed the available buffer size, mitigating potential buffer overflow risks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 27, 2026

Walkthrough

The changes add a new public API function avro_max_read() to query available bytes for reading in the Avro C library. A corresponding macro AVRO_SAFE_READ() wraps avro_read calls to handle errors by freeing allocated memory and nullifying pointers. The binary encoding functions for reading strings and bytes are updated to use this macro and perform bounds checking before reading data. Implementation and declarations are added to io.c and io.h, with the macro defined in encoding.h.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Jan 27, 2026

Pull Request Review: Memory Leak Fix in Avro C Library

Summary

This PR addresses a memory leak issue in the Avro C library when handling abnormal/malformed data. The fix adds bounds checking and safe cleanup mechanisms for string and bytes reading operations.


Code Quality & Best Practices

✅ Strengths

  1. Good defensive programming: The addition of bounds checking prevents buffer overruns when malformed data claims a string/bytes length larger than available buffer space.

  2. Consistent error handling: The new AVRO_SAFE_READ macro follows the existing pattern established by AVRO_READ, AVRO_WRITE, and AVRO_SKIP macros.

  3. Clear error messages: The added error message in read_string (line 196-197 in encoding_binary.c) provides helpful diagnostic information.

⚠️ Issues to Address

1. Code Style: Inconsistent Spacing (lang/c/src/encoding_binary.c:139-141)

	}
	
	(*bytes)[*len] = '\0';

There's an unnecessary blank line after the closing brace. The existing codebase style doesn't use blank lines in this context.

Recommendation: Remove the blank line at line 139.

2. Logic Issue: Null Terminator Before Read (lang/c/src/encoding_binary.c:140-141)

(*bytes)[*len] = '\0';
AVRO_SAFE_READ(reader, *bytes, *len,  *len+1);

The null terminator is set BEFORE the read operation. If AVRO_SAFE_READ fails, this line was executed unnecessarily. While not a bug, it's better to set the null terminator after a successful read for clarity.

Recommendation: Move line 140 after line 141, similar to the pattern in read_string where (*s)[str_len] = '\0'; comes before the read but the entire allocation+read is atomic.

Actually, looking more carefully at the logic, the null terminator should be set AFTER the successful read to ensure proper cleanup ordering. The current order in read_string (line 207-208) has the same issue.

3. Spacing Issue in Macro Definition (lang/c/src/encoding_binary.c:141)

AVRO_SAFE_READ(reader, *bytes, *len,  *len+1);

There's a double space before *len+1. This should be a single space for consistency.

4. Comment Style (lang/c/src/encoding_binary.c:193)

    //  max := r.tail - r.head + 1; if max >= 0 && size > max

The codebase uses C-style comments (/* */) throughout, not C++-style comments (//). This breaks consistency.

Recommendation: Change to:

/* max := r.tail - r.head + 1; if max >= 0 && size > max */

5. Indentation Issue (lang/c/src/encoding_binary.c:193-199)

The comment and subsequent code block use leading spaces for indentation, but the rest of the file uses tabs (as evidenced by the consistent indentation throughout the codebase).

Recommendation: Use tabs for indentation to match the existing code style.


Potential Bugs & Issues

🔴 Critical: NULL Pointer Dereference in AVRO_SAFE_READ (lang/c/src/encoding.h:102-103)

#define AVRO_SAFE_READ(reader, buf, len, mem_size)  \
{ int rval = avro_read( reader, buf, len ); if(rval) { if(buf) avro_free(buf, mem_size); buf = NULL;  return rval; } }

Issue: The macro sets buf = NULL, but this only affects the local macro variable, not the caller's pointer. This means:

  1. Memory leak still occurs: The allocated memory is freed, but the caller's pointer still holds the dangling pointer address
  2. Undefined behavior risk: The caller may attempt to use the dangling pointer

Example problematic flow in read_bytes:

*bytes = (char *) avro_malloc(*len + 1);  // *bytes points to allocated memory
AVRO_SAFE_READ(reader, *bytes, *len, *len+1);  
// If read fails: memory is freed, but *bytes still contains the old address (dangling pointer)
// The macro's "buf = NULL" only affects the macro's local copy, not *bytes

Recommendation: The macro cannot safely NULL the caller's pointer. Instead:

Option 1 (Preferred): Make callers responsible for cleanup:

#define AVRO_SAFE_READ(reader, buf, len, mem_size)  \
{ int rval = avro_read( reader, buf, len ); if(rval) { avro_free(buf, mem_size); return rval; } }

Then in read_bytes:

*bytes = (char *) avro_malloc(*len + 1);
if (!*bytes) {
    avro_set_error("Cannot allocate buffer for bytes value");
    return ENOMEM;
}
if (avro_read(reader, *bytes, *len)) {
    avro_free(*bytes, *len + 1);
    *bytes = NULL;  // Explicitly NULL the caller's pointer
    return rval;
}
(*bytes)[*len] = '\0';

Option 2: Don't use a macro; use inline functions or expand the cleanup code in each location.


Performance Considerations

✅ Good Performance Characteristics

  1. Minimal overhead: The new avro_max_read() function (lang/c/src/io.c:278-288) has O(1) complexity and adds negligible overhead.

  2. Early validation: Checking buffer size before allocation prevents wasted allocations for invalid data.

📝 Minor Observation

The avro_max_read() is called for read_string but not for read_bytes. For consistency and to prevent potential memory leaks in the bytes reading path as well, consider adding the same check to read_bytes.

Current state:

  • read_string: Has bounds checking ✅
  • read_bytes: No bounds checking ❌

Recommendation: Add similar bounds checking to read_bytes:

static int read_bytes(avro_reader_t reader, char **bytes, int64_t * len)
{
	int64_t max_available = -1;
	int rval;
	check_prefix(rval, read_long(reader, len),
		     "Cannot read bytes length: ");
	if (*len < 0) {
		avro_set_error("Invalid bytes length: %" PRId64, *len);
		return EINVAL;
	}
	
	/* Check if claimed length exceeds available buffer */
	max_available = avro_max_read(reader);
	if (max_available >= 0 && *len > max_available) {
		avro_set_error("mem io: Bytes length %" PRId64 " is greater than available buffer size %" PRId64,
				*len, max_available);
		return ERANGE;
	}
	
	*bytes = (char *) avro_malloc(*len + 1);
	if (!*bytes) {
		avro_set_error("Cannot allocate buffer for bytes value");
		return ENOMEM;
	}
	(*bytes)[*len] = '\0';
	AVRO_SAFE_READ(reader, *bytes, *len, *len+1);
	return 0;
}

Security Concerns

✅ Security Improvements

  1. DoS prevention: The bounds checking prevents attackers from triggering large memory allocations with malformed data claiming huge string lengths.

  2. Memory exhaustion protection: By validating against available buffer size, the code prevents memory exhaustion attacks.

🔒 Remaining Considerations

  1. Integer overflow: The expression *len + 1 in the malloc call could theoretically overflow if *len == INT64_MAX. However, this is already mitigated by the bounds checking against available buffer size.

  2. Error message information disclosure: The error messages reveal internal buffer sizes. This is generally acceptable for a library but could theoretically aid attackers in crafting exploits. This is a very minor concern.


Test Coverage

❌ Missing Test Coverage

Critical issue: There are no tests included in this PR to verify:

  1. The memory leak is actually fixed
  2. The new bounds checking works correctly
  3. Error conditions are handled properly
  4. No regressions are introduced

Recommendation: Add tests that:

  1. Test malformed data handling: Create test cases with strings/bytes claiming lengths larger than the buffer

    // Pseudo-code example
    test_string_length_exceeds_buffer() {
        char buffer[10] = {0x14, ...}; // Varint encoding for length=10 but buffer has only 2 bytes
        avro_reader_t reader = avro_reader_memory(buffer, 2);
        char *str;
        int64_t len;
        int result = read_string(reader, &str, &len);
        assert(result == ERANGE);
        assert(str == NULL); // Verify no leak
    }
  2. Memory leak detection: Run tests under Valgrind or AddressSanitizer to verify leaks are fixed

  3. Regression tests: Ensure normal operation still works correctly

  4. Boundary cases: Test edge cases like empty strings, maximum valid lengths, etc.


API Changes

⚠️ New Public API

The function avro_max_read() is added to the public header lang/c/src/avro/io.h:55. This is a breaking change in terms of API surface area.

Considerations:

  1. Documentation needed: This function needs documentation comments explaining its purpose and return values
  2. Naming convention: Consider whether this follows Avro's API naming conventions
  3. API stability: Will this be part of the stable API going forward?

Recommendation: Add Doxygen-style documentation:

/**
 * Returns the maximum number of bytes that can be read from the reader.
 * 
 * @param reader The reader to query
 * @return The number of bytes available, or -1 if unknown/unlimited
 */
int64_t avro_max_read(avro_reader_t reader);

Summary & Recommendations

Required Changes

  1. Fix the critical bug in AVRO_SAFE_READ macro - The NULL assignment doesn't work as intended
  2. Add test coverage - This is a memory safety fix and must have tests
  3. Fix code style issues - Comments, spacing, indentation
  4. Apply bounds checking to read_bytes - For consistency and completeness

Nice-to-Have Improvements

  1. Add API documentation for avro_max_read()
  2. Consider whether avro_max_read() should be a public API or internal helper
  3. Run the test suite under memory sanitizers to verify the fix

Estimated Risk

  • Current risk level: Medium-High (due to the AVRO_SAFE_READ bug)
  • After fixes: Low

The core idea of adding bounds checking is sound and addresses a real security/stability issue. However, the implementation has a critical bug that needs to be fixed before merging.


Overall Assessment: The PR addresses a legitimate issue but needs revisions before merging. The approach is good, but the execution has bugs that could cause the same memory leaks it's trying to prevent.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to address a memory leak issue in the Avro C library by introducing robust error handling and memory management, specifically through the new avro_max_read function and AVRO_SAFE_READ macro. However, the current implementation has critical security and functional issues. The read_bytes function is missing OOM protection and is vulnerable to a heap buffer overflow via integer overflow on 32-bit systems. Furthermore, the avro_max_read implementation for file IO is broken, which will cause the library to fail when reading strings longer than 4KB from files, leading to a significant regression and denial of service.

AVRO_READ(reader, *bytes, *len);

(*bytes)[*len] = '\0';
AVRO_SAFE_READ(reader, *bytes, *len, *len+1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The read_bytes function is missing the avro_max_read check that was added to read_string. This allows an attacker to trigger large memory allocations by providing a large length in the binary data, leading to Denial of Service (OOM). Additionally, the expression *len + 1 used in the allocation (line 134) and the AVRO_SAFE_READ macro (line 141) is susceptible to integer overflow. On 32-bit systems, if *len is INT64_MAX, *len + 1 wraps to 0, leading to a small allocation followed by an out-of-bounds write at line 140.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Gemini AI reviewer is correct! The addition of the extra null byte is problematic! If it is added before the safe read then it may lead to integer overflow when len=i64::MAX or the safe read fails then the appended \0 should be removed too. Adding it after the safe is safer because it does not have these issues.

return mem_reader->len - mem_reader->read;
} else if (is_file_io(reader)) {
struct _avro_reader_file_t *file_reader = avro_reader_to_file(reader);
return bytes_available(file_reader);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The implementation of avro_max_read for file-based readers is incorrect. It returns the number of bytes currently in the internal buffer (reader->end - reader->cur), which is at most 4096 bytes. This causes read_string to incorrectly reject any string longer than 4KB with an ERANGE error, even if the file contains the full string. For file readers where the total size is not readily available in the buffer, this should return -1 to bypass the length check and avoid a denial of service for legitimate data.

return -1;

Comment on lines +102 to +103
#define AVRO_SAFE_READ(reader, buf, len, mem_size) \
{ int rval = avro_read( reader, buf, len ); if(rval) { if(buf) avro_free(buf, mem_size); buf = NULL; return rval; } }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The AVRO_SAFE_READ macro introduces complex logic, including conditional memory deallocation and an early return. While effective for safety, macros with control flow can sometimes lead to unexpected behavior, make debugging more challenging, and reduce readability compared to a static inline function. A function would provide better type checking and scope management, reducing the risk of misuse.

Consider encapsulating this logic in a static inline function for improved maintainability and robustness, if the project's coding standards allow for it.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Jan 27, 2026

🤖 Augment PR Summary

Summary: Hardens Avro C binary decoding to avoid leaks and improve safety under abnormal/corrupt input scenarios.

Changes:

  • Adds avro_max_read() to report remaining readable bytes for memory/file-backed readers.
  • Introduces AVRO_SAFE_READ to free allocated buffers when avro_read() fails.
  • Updates read_bytes()/read_string() to use safe reads and adds a pre-check for oversized strings.

    Technical Notes: This aims to prevent heap buffers allocated for strings/bytes from leaking when read operations error out.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

return mem_reader->len - mem_reader->read;
} else if (is_file_io(reader)) {
struct _avro_reader_file_t *file_reader = avro_reader_to_file(reader);
return bytes_available(file_reader);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In avro_max_read(), the file-IO branch returns bytes_available(file_reader), which only reflects the current internal buffer (and can be 0/undefined before the first fill). Callers like read_string() can then incorrectly reject valid reads when the requested size exceeds the currently buffered bytes.

Other Locations
  • lang/c/src/encoding_binary.c:194

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The new functionality is not the same for memory based reads and file-based ones. For inmemory reads it returns the complete length of the bytes which could be read, while for files it returns only the free bytes in the buffer, not the bytes remaining to be read from the file itself.

return EINVAL;
}
// max := r.tail - r.head + 1; if max >= 0 && size > max
max_available = avro_max_read(reader);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While adding the max_available guard, consider also guarding against overflow in the str_len + 1 size computation/allocation. With a very large str_len from corrupted input, signed overflow or a wrap when converting to size_t could lead to an undersized allocation and an out-of-bounds read/write.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The addition of the extra null byte is problematic! If it is added before the safe read then it may lead to integer overflow when len=i64::MAX or the safe read fails then the appended \0 should be removed too. Adding it after the safe is safer because it does not have these issues.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@lang/c/src/encoding_binary.c`:
- Around line 193-199: The comment above the max_available check is inaccurate
and the error message wrongly prefixes "mem io:" even though the check is used
for all reader types; update the comment to reflect the actual computation used
by avro_max_read (e.g., "max := len - read for memory / end - cur for file"
without "+1") and change the avro_set_error call in the block using
max_available, str_len and avro_max_read to use a generic prefix such as "io:"
or "reader:" (remove "mem io:") so the message correctly reflects both memory
and file readers.

In `@lang/c/src/io.c`:
- Around line 277-288: avro_max_read currently returns only buffered bytes for
file readers (via bytes_available), causing read_string bounds checks to fail
for valid reads; change the file branch in avro_max_read (is_file_io /
avro_reader_to_file) to compute total remaining bytes in the underlying file
instead of just the buffer by using the file's size and current absolute read
position (e.g., use fseeko/ftello or equivalent on file_reader->file to get
file_size and current_offset and return file_size - current_offset +
bytes_available(file_reader)), so callers like read_string in encoding_binary.c
get the true remaining byte count.
🧹 Nitpick comments (3)
lang/c/src/avro/io.h (1)

55-55: Consider adding documentation for the new public API.

The new avro_max_read function is a public API addition. Adding a brief comment describing its behavior (returns available bytes for memory readers, buffered bytes for file readers, or -1 for unknown types) would improve maintainability and help consumers understand the semantics.

📝 Suggested documentation
+/*
+ * Returns the number of bytes available to read from the reader.
+ * For memory readers, returns remaining bytes (len - read).
+ * For file readers, returns buffered bytes available.
+ * Returns -1 for unknown reader types.
+ */
 int64_t avro_max_read(avro_reader_t reader);
lang/c/src/encoding_binary.c (2)

139-142: AVRO_SAFE_READ correctly prevents memory leak on read failure.

The change properly frees the allocated buffer if avro_read fails, fixing the memory leak. Note that read_bytes does not have the same avro_max_read pre-check that read_string has—consider adding it for consistency to prevent large allocations on invalid input.

♻️ Optional: Add bounds check for consistency with read_string
 static int read_bytes(avro_reader_t reader, char **bytes, int64_t * len)
 {
 	int rval;
+	int64_t max_available;
 	check_prefix(rval, read_long(reader, len),
 		     "Cannot read bytes length: ");
 	if (*len < 0) {
 		avro_set_error("Invalid bytes length: %" PRId64, *len);
 		return EINVAL;
 	}
+	max_available = avro_max_read(reader);
+	if (max_available >= 0 && *len > max_available) {
+		avro_set_error("Bytes length %" PRId64 " exceeds available buffer %" PRId64,
+			*len, max_available);
+		return ERANGE;
+	}
 	*bytes = (char *) avro_malloc(*len + 1);

193-199: Inconsistent indentation style.

Lines 193-199 use spaces for indentation while the rest of the file uses tabs. This should be aligned with the file's existing style.

Comment on lines +193 to +199
// max := r.tail - r.head + 1; if max >= 0 && size > max
max_available = avro_max_read(reader);
if (max_available >= 0 && str_len > max_available) {
avro_set_error("mem io: String length %" PRId64 " is greater than available buffer size %" PRId64,
str_len, max_available);
return ERANGE;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix inaccurate comment and misleading error message.

  1. Line 193: The comment formula max := r.tail - r.head + 1 doesn't match the implementation (len - read for memory, end - cur for file—no +1).
  2. Line 196: The error message prefix "mem io:" is misleading since this check applies to all reader types (memory and file).
📝 Suggested fix
-    //  max := r.tail - r.head + 1; if max >= 0 && size > max
+	// Bounds check: reject if string length exceeds available bytes
 	max_available = avro_max_read(reader);
 	if (max_available >= 0 && str_len > max_available) {
-	    avro_set_error("mem io: String length %" PRId64 " is greater than available buffer size %" PRId64,
+		avro_set_error("String length %" PRId64 " exceeds available buffer %" PRId64,
 				str_len, max_available);
 		return ERANGE;
 	}
🤖 Prompt for AI Agents
In `@lang/c/src/encoding_binary.c` around lines 193 - 199, The comment above the
max_available check is inaccurate and the error message wrongly prefixes "mem
io:" even though the check is used for all reader types; update the comment to
reflect the actual computation used by avro_max_read (e.g., "max := len - read
for memory / end - cur for file" without "+1") and change the avro_set_error
call in the block using max_available, str_len and avro_max_read to use a
generic prefix such as "io:" or "reader:" (remove "mem io:") so the message
correctly reflects both memory and file readers.

Comment on lines +277 to +288

int64_t avro_max_read(avro_reader_t reader)
{
if (is_memory_io(reader)) {
struct _avro_reader_memory_t *mem_reader = avro_reader_to_memory(reader);
return mem_reader->len - mem_reader->read;
} else if (is_file_io(reader)) {
struct _avro_reader_file_t *file_reader = avro_reader_to_file(reader);
return bytes_available(file_reader);
}
return -1;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for all usages of avro_max_read
rg -n "avro_max_read" --type=c

Repository: martin-augment/avro

Length of output: 268


🏁 Script executed:

# Check the implementation of bytes_available
rg -n -B5 -A10 "bytes_available" lang/c/src/io.c

Repository: martin-augment/avro

Length of output: 2994


🏁 Script executed:

# Look at read_string implementation in encoding_binary.c
rg -n -B3 -A15 "read_string" lang/c/src/encoding_binary.c

Repository: martin-augment/avro

Length of output: 1265


🏁 Script executed:

# Check file reader structure and implementation
rg -n "_avro_reader_file_t" --type=c

Repository: martin-augment/avro

Length of output: 902


🏁 Script executed:

# Check file reader initialization and buffer state
rg -n -B2 -A8 "avro_reader_new_file" lang/c/src/io.c

Repository: martin-augment/avro

Length of output: 45


🏁 Script executed:

# Verify the file reader buffer initialization and typical usage patterns
rg -n -B5 -A10 "reader->buffer" lang/c/src/io.c | head -60

Repository: martin-augment/avro

Length of output: 1627


File reader returns only buffered bytes, not total remaining bytes.

For file-based readers, bytes_available(file_reader) returns only the bytes currently in the internal buffer (reader->end - reader->cur), not the remaining bytes in the file. This creates asymmetric behavior:

  • Memory readers: returns total remaining bytes (mem_reader->len - mem_reader->read)
  • File readers: returns only buffered bytes (potentially zero if buffer is empty)

When read_string in encoding_binary.c (line 194-197) uses this for bounds checking, it incorrectly rejects valid reads from file-based readers when the requested string length exceeds the current buffer size, even though avro_read_file has logic to refill the buffer from the file and fulfill such reads.

Note: The error message in read_string says "mem io", indicating the check was intended for memory-based IO only, but it applies equally to file-based readers.

🤖 Prompt for AI Agents
In `@lang/c/src/io.c` around lines 277 - 288, avro_max_read currently returns only
buffered bytes for file readers (via bytes_available), causing read_string
bounds checks to fail for valid reads; change the file branch in avro_max_read
(is_file_io / avro_reader_to_file) to compute total remaining bytes in the
underlying file instead of just the buffer by using the file's size and current
absolute read position (e.g., use fseeko/ftello or equivalent on
file_reader->file to get file_size and current_offset and return file_size -
current_offset + bytes_available(file_reader)), so callers like read_string in
encoding_binary.c get the true remaining byte count.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The new functionality is not the same for memory based reads and file-based ones. For inmemory reads it returns the complete length of the bytes which could be read, while for files it returns only the free bytes in the buffer, not the bytes remaining to be read from the file itself.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

return mem_reader->len - mem_reader->read;
} else if (is_file_io(reader)) {
struct _avro_reader_file_t *file_reader = avro_reader_to_file(reader);
return bytes_available(file_reader);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File reader incorrectly rejects valid string reads

High Severity

The avro_max_read function for file I/O returns bytes_available(file_reader), which only reflects bytes currently buffered (up to 4096), not total remaining file data. When read_string uses this to validate string length, it incorrectly rejects valid strings from files when str_len exceeds the current buffer contents, even though avro_read_file can read more by refilling from the file. For file readers, the function should return -1 to skip the validation check.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! The new functionality is not the same for memory based reads and file-based ones. For inmemory reads it returns the complete length of the bytes which could be read, while for files it returns only the free bytes in the buffer, not the bytes remaining to be read from the file itself.

@martin-augment
Copy link
Copy Markdown
Owner Author

2. Logic Issue: Null Terminator Before Read (lang/c/src/encoding_binary.c:140-141)

(*bytes)[*len] = '\0';
AVRO_SAFE_READ(reader, *bytes, *len,  *len+1);

The null terminator is set BEFORE the read operation. If AVRO_SAFE_READ fails, this line was executed unnecessarily. While not a bug, it's better to set the null terminator after a successful read for clarity.

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The addition of the extra null byte is problematic! If it is added before the safe read then it may lead to integer overflow when len=i64::MAX or the safe read fails then the appended \0 should be removed too. Adding it after the safe is safer because it does not have these issues.

@martin-augment
Copy link
Copy Markdown
Owner Author

The avro_max_read() is called for read_string but not for read_bytes. For consistency and to prevent potential memory leaks in the bytes reading path as well, consider adding the same check to read_bytes.

Current state:

  • read_string: Has bounds checking ✅
  • read_bytes: No bounds checking ❌

Recommendation: Add similar bounds checking to read_bytes:

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The newly added safe read functionality is used for reading strings but not for reading bytes. It should be used for bytes too. Prevents unsafe bytes read.

@martin-augment
Copy link
Copy Markdown
Owner Author

139-142: AVRO_SAFE_READ correctly prevents memory leak on read failure.

The change properly frees the allocated buffer if avro_read fails, fixing the memory leak. Note that read_bytes does not have the same avro_max_read pre-check that read_string has—consider adding it for consistency to prevent large allocations on invalid input.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The newly added safe read functionality is used for reading strings but not for reading bytes. It should be used for bytes too. Prevents unsafe bytes read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants